Welcome to our little project on Fake News detection using tree-based models specifically, mainly because of their possibly more natural explanative value. In the following steps, we will prepare the data as input for the models, then extract features and build the models. The dataset used can be viewed under the link below, it is an updated version of the dataset used in:
Ahmed, H., Traore, I., Saad, S. (2017). Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore, I., Woungang, I., Awad, A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science(), vol 10618. Springer, Cham. https://doi.org/10.1007/978-3-319-69155-8_9
The general goal as proposed is to get better insights in why the authors in the study did not pursue tree-based models, as their tree-based model did not score much worse than the possibly less explainable models. We hope to explore a bit around the topic as well.
#!wget https://onlineacademiccommunity.uvic.ca/isot/wp-content/uploads/sites/7295/2023/03/News-_dataset.zip
#!unzip -o News-_dataset.zip
The two input files Fake.csv and True.csv contain exactly that, true and fake articles. A tabular overview is provided along the dataset:

For our task we rely mainly on in-memory analytics star pandas along with some ML/AI libraries for the models
import pandas as pd
df_fake = pd.read_csv("Fake.csv")
df_fake
| title | text | subject | date | |
|---|---|---|---|---|
| 0 | Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News | December 31, 2017 |
| 1 | Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News | December 31, 2017 |
| 2 | Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News | December 30, 2017 |
| 3 | Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News | December 29, 2017 |
| 4 | Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News | December 25, 2017 |
| ... | ... | ... | ... | ... |
| 23476 | McPain: John McCain Furious That Iran Treated ... | 21st Century Wire says As 21WIRE reported earl... | Middle-east | January 16, 2016 |
| 23477 | JUSTICE? Yahoo Settles E-mail Privacy Class-ac... | 21st Century Wire says It s a familiar theme. ... | Middle-east | January 16, 2016 |
| 23478 | Sunnistan: US and Allied ‘Safe Zone’ Plan to T... | Patrick Henningsen 21st Century WireRemember ... | Middle-east | January 15, 2016 |
| 23479 | How to Blow $700 Million: Al Jazeera America F... | 21st Century Wire says Al Jazeera America will... | Middle-east | January 14, 2016 |
| 23480 | 10 U.S. Navy Sailors Held by Iranian Military ... | 21st Century Wire says As 21WIRE predicted in ... | Middle-east | January 12, 2016 |
23481 rows × 4 columns
You can see that the original dataset features 4 columns, including the title and the type as well as the date. For our use-case, we mainly rely on the full-text rather than the title. One might argue that the title itself contains valuable information, however, we think that out-of-context lurid phrases like the title might even be misleading in some more quality-oriented journalism. Therefore, we will remove this later on, but keep it right now for context.
df_fake["fake"] = True
df_fake
| title | text | subject | date | fake | |
|---|---|---|---|---|---|
| 0 | Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News | December 31, 2017 | True |
| 1 | Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News | December 31, 2017 | True |
| 2 | Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News | December 30, 2017 | True |
| 3 | Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News | December 29, 2017 | True |
| 4 | Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News | December 25, 2017 | True |
| ... | ... | ... | ... | ... | ... |
| 23476 | McPain: John McCain Furious That Iran Treated ... | 21st Century Wire says As 21WIRE reported earl... | Middle-east | January 16, 2016 | True |
| 23477 | JUSTICE? Yahoo Settles E-mail Privacy Class-ac... | 21st Century Wire says It s a familiar theme. ... | Middle-east | January 16, 2016 | True |
| 23478 | Sunnistan: US and Allied ‘Safe Zone’ Plan to T... | Patrick Henningsen 21st Century WireRemember ... | Middle-east | January 15, 2016 | True |
| 23479 | How to Blow $700 Million: Al Jazeera America F... | 21st Century Wire says Al Jazeera America will... | Middle-east | January 14, 2016 | True |
| 23480 | 10 U.S. Navy Sailors Held by Iranian Military ... | 21st Century Wire says As 21WIRE predicted in ... | Middle-east | January 12, 2016 | True |
23481 rows × 5 columns
df_non_fake = pd.read_csv("True.csv")
df_non_fake["fake"] = False
df = pd.concat([df_non_fake, df_fake])
df
| title | text | subject | date | fake | |
|---|---|---|---|---|---|
| 0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 | False |
| 1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 | False |
| 2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 | False |
| 3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 | False |
| 4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 | False |
| ... | ... | ... | ... | ... | ... |
| 23476 | McPain: John McCain Furious That Iran Treated ... | 21st Century Wire says As 21WIRE reported earl... | Middle-east | January 16, 2016 | True |
| 23477 | JUSTICE? Yahoo Settles E-mail Privacy Class-ac... | 21st Century Wire says It s a familiar theme. ... | Middle-east | January 16, 2016 | True |
| 23478 | Sunnistan: US and Allied ‘Safe Zone’ Plan to T... | Patrick Henningsen 21st Century WireRemember ... | Middle-east | January 15, 2016 | True |
| 23479 | How to Blow $700 Million: Al Jazeera America F... | 21st Century Wire says Al Jazeera America will... | Middle-east | January 14, 2016 | True |
| 23480 | 10 U.S. Navy Sailors Held by Iranian Military ... | 21st Century Wire says As 21WIRE predicted in ... | Middle-east | January 12, 2016 | True |
44898 rows × 5 columns
df.describe()
| title | text | subject | date | fake | |
|---|---|---|---|---|---|
| count | 44898 | 44898 | 44898 | 44898 | 44898 |
| unique | 38729 | 38646 | 8 | 2397 | 2 |
| top | Factbox: Trump fills top jobs for his administ... | politicsNews | December 20, 2017 | True | |
| freq | 14 | 627 | 11272 | 182 | 23481 |
Upon checking the data, what quickly comes to mind is that there seemingly are a lot of duplicates regarding the text. Since duplicates in general might skew results and in general do not add much value, we will look into this first, possibly spotting also other interesting points.
df[df["text"].duplicated()].sort_values(by="text")
| title | text | subject | date | fake | |
|---|---|---|---|---|---|
| 12655 | VIRAL VIDEO: A Must Watch Video Hillary Doesn’... | politics | Oct 22, 2016 | True | |
| 12800 | HILLARY CLINTON RAPE ENABLER: “What kind of mo... | politics | Oct 8, 2016 | True | |
| 12802 | BREAKING: DONALD TRUMP Video Statement On Leak... | politics | Oct 8, 2016 | True | |
| 12825 | FUNNY! MSNBC ANCHOR ASKS Millennial Women If T... | politics | Oct 5, 2016 | True | |
| 12831 | A MUST SEE! MEDIA SCORCHED FOR THEIR BIAS AGAI... | politics | Oct 4, 2016 | True | |
| ... | ... | ... | ... | ... | ... |
| 19548 | CNN HOST To Jill Stein: “Have You Seen Any Dir... | https://youtu.be/E2KFe_htBSA And I think the f... | left-news | Nov 27, 2016 | True |
| 19524 | THE VIEW’S Whoopi Goldberg To Co-Host: “This I... | https://youtu.be/RTuxvWjH3a4 | left-news | Dec 1, 2016 | True |
| 19277 | TEMPERS FLARE IN DC: BIKERS FOR TRUMP Break Th... | https://youtu.be/ZfRYj2ZX3dE#Trump supporter g... | left-news | Jan 20, 2017 | True |
| 19619 | TRUMP SUPPORTER Whose Brutal Beating By Black ... | https://youtu.be/kKFQ5i9jXmA | left-news | Nov 14, 2016 | True |
| 19410 | OHIO ELECTOR TORCHES Anti-Trump Letters He Rec... | pic.twitter.com/KMnLrwB6t1 Richard K. Jones (... | left-news | Dec 21, 2016 | True |
6252 rows × 5 columns
Two things become visible especially, one being that many columns are empty and that there are a lot of texts just containing a click-bait title and a link, for example a youtube link. Since we cannot really check the links content and would just need to be checking the title, and the general use-case example is not really focused on click-bait media, we should probably remove this. This could be handled by e.g. either removing link-only articles or by enforcing a lower bound for article length.
df["text_length"] = df["text"].apply(len)
df["text_length"].describe()
count 44898.000000 mean 2469.109693 std 2171.617091 min 1.000000 25% 1234.000000 50% 2186.000000 75% 3105.000000 max 51794.000000 Name: text_length, dtype: float64
When looking at the general distribution of the text length, it seems obvious that there are in general just very few low length articles. We decided that cutting all articles with a length lower than 100 seems like a good idea, as they will probably just reference to some other resource.
df = df[df["text_length"] >= 100]
df
| title | text | subject | date | fake | text_length | |
|---|---|---|---|---|---|---|
| 0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 | False | 4659 |
| 1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 | False | 4077 |
| 2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 | False | 2789 |
| 3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 | False | 2461 |
| 4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 | False | 5204 |
| ... | ... | ... | ... | ... | ... | ... |
| 23476 | McPain: John McCain Furious That Iran Treated ... | 21st Century Wire says As 21WIRE reported earl... | Middle-east | January 16, 2016 | True | 3237 |
| 23477 | JUSTICE? Yahoo Settles E-mail Privacy Class-ac... | 21st Century Wire says It s a familiar theme. ... | Middle-east | January 16, 2016 | True | 1684 |
| 23478 | Sunnistan: US and Allied ‘Safe Zone’ Plan to T... | Patrick Henningsen 21st Century WireRemember ... | Middle-east | January 15, 2016 | True | 25065 |
| 23479 | How to Blow $700 Million: Al Jazeera America F... | 21st Century Wire says Al Jazeera America will... | Middle-east | January 14, 2016 | True | 2685 |
| 23480 | 10 U.S. Navy Sailors Held by Iranian Military ... | 21st Century Wire says As 21WIRE predicted in ... | Middle-east | January 12, 2016 | True | 5251 |
43858 rows × 6 columns
We started with 44898 rows, now there are only 43858 left. Seems reasonable. Let us now look at the rest of the "duplicates".
df[df["text"].duplicated(keep=False)].sort_values(by="text")
| title | text | subject | date | fake | text_length | |
|---|---|---|---|---|---|---|
| 13677 | MUSLIM INVASION OF AMERICA In Full Swing: Obam... | (Welcome) to America We hope you enjoy our... | politics | Jun 17, 2016 | True | 2494 |
| 16547 | MUSLIM INVASION OF AMERICA In Full Swing: Obam... | (Welcome) to America We hope you enjoy our... | Government News | Jun 17, 2016 | True | 2494 |
| 18505 | WOW! MAJOR CREDIT CARD COMPANY Still Sponsorin... | Delta Air Lines and Bank of America became ... | left-news | Jun 12, 2017 | True | 1889 |
| 10645 | WOW! MAJOR CREDIT CARD COMPANY Still Sponsorin... | Delta Air Lines and Bank of America became ... | politics | Jun 12, 2017 | True | 1889 |
| 16289 | A MUST WATCH! “It’s Time To Show America Is Bi... | #PresidentElectTrumpABSOLUTELY MUST WATCHTod... | Government News | Dec 21, 2016 | True | 198 |
| ... | ... | ... | ... | ... | ... | ... |
| 11234 | BONKERS BERNIE SANDERS: Prioritizing Jobs Over... | https://www.youtube.com/watch?v=GPqQIlWksbgVer... | politics | Apr 1, 2017 | True | 1463 |
| 19548 | CNN HOST To Jill Stein: “Have You Seen Any Dir... | https://youtu.be/E2KFe_htBSA And I think the f... | left-news | Nov 27, 2016 | True | 871 |
| 12285 | CNN HOST To Jill Stein: “Have You Seen Any Dir... | https://youtu.be/E2KFe_htBSA And I think the f... | politics | Nov 27, 2016 | True | 871 |
| 19277 | TEMPERS FLARE IN DC: BIKERS FOR TRUMP Break Th... | https://youtu.be/ZfRYj2ZX3dE#Trump supporter g... | left-news | Jan 20, 2017 | True | 405 |
| 11849 | TEMPERS FLARE IN DC: BIKERS FOR TRUMP Break Th... | https://youtu.be/ZfRYj2ZX3dE#Trump supporter g... | politics | Jan 20, 2017 | True | 405 |
10546 rows × 6 columns
We can see that we reduced the duplicates already, however, still having a lot of duplicates by text. The smartest thing to do here is to deduplicate it. Pandas Dataframe.drop_duplicates(keep=first) keeps the first occurrence of the duplicate.df
df_cleaned = df.drop_duplicates(subset="text", keep="first")
df_cleaned["text"].duplicated().sum()
0
df_cleaned[df_cleaned["fake"] == True].count()
title 17157 text 17157 subject 17157 date 17157 fake 17157 text_length 17157 dtype: int64
df_cleaned[df_cleaned["fake"] == False].count()
title 21191 text 21191 subject 21191 date 21191 fake 21191 text_length 21191 dtype: int64
df_cleaned[df_cleaned["fake"] == False].groupby(by="subject").count()
| title | text | date | fake | text_length | |
|---|---|---|---|---|---|
| subject | |||||
| politicsNews | 11213 | 11213 | 11213 | 11213 | 11213 |
| worldnews | 9978 | 9978 | 9978 | 9978 | 9978 |
df_cleaned[df_cleaned["fake"] == True].groupby(by="subject").count()
| title | text | date | fake | text_length | |
|---|---|---|---|---|---|
| subject | |||||
| Government News | 498 | 498 | 498 | 498 | 498 |
| News | 9050 | 9050 | 9050 | 9050 | 9050 |
| US_News | 783 | 783 | 783 | 783 | 783 |
| left-news | 663 | 663 | 663 | 663 | 663 |
| politics | 6163 | 6163 | 6163 | 6163 | 6163 |
# Transform "fake" labels to fake = 1, not-fake = 0
df_cleaned["fake"] = df_cleaned["fake"].replace(to_replace=False, value=0)
df_cleaned["fake"] = df_cleaned["fake"].replace(to_replace=True, value=1)
/tmp/ipykernel_78769/1935809760.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_cleaned["fake"] = df_cleaned["fake"].replace(to_replace=False, value=0) /tmp/ipykernel_78769/1935809760.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_cleaned["fake"] = df_cleaned["fake"].replace(to_replace=True, value=1)
df_cleaned
| title | text | subject | date | fake | text_length | |
|---|---|---|---|---|---|---|
| 0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 | 0 | 4659 |
| 1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 | 0 | 4077 |
| 2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 | 0 | 2789 |
| 3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 | 0 | 2461 |
| 4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 | 0 | 5204 |
| ... | ... | ... | ... | ... | ... | ... |
| 22698 | The White House and The Theatrics of ‘Gun Cont... | 21st Century Wire says All the world s a stage... | US_News | January 7, 2016 | 1 | 7359 |
| 22699 | Activists or Terrorists? How Media Controls an... | Randy Johnson 21st Century WireThe majority ... | US_News | January 7, 2016 | 1 | 26275 |
| 22700 | BOILER ROOM – No Surrender, No Retreat, Heads ... | Tune in to the Alternate Current Radio Network... | US_News | January 6, 2016 | 1 | 1150 |
| 22701 | Federal Showdown Looms in Oregon After BLM Abu... | 21st Century Wire says A new front has just op... | US_News | January 4, 2016 | 1 | 20651 |
| 22702 | A Troubled King: Chicago’s Rahm Emanuel Desper... | 21st Century Wire says It s not that far away.... | US_News | January 2, 2016 | 1 | 5749 |
38348 rows × 6 columns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(df_cleaned['text'], df_cleaned['fake'], test_size=0.2, random_state=15)
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000)
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)
# Train a decision tree classifier
tree_model = DecisionTreeClassifier()
tree_model.fit(train_features, train_labels)
# Extract feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Visualize the decision tree
tree.plot_tree(tree_model, feature_names=feature_names)
# Get the most important features for classification
importance = tree_model.feature_importances_
important_features = [(feature_names[i], importance[i]) for i in range(len(feature_names))]
important_features.sort(key=lambda x: x[1], reverse=True)
# Print the top 10 important features
print("Top 10 important features:")
for feature, importance in important_features[:10]:
print(f"{feature}: {importance}")
Top 10 important features: reuters: 0.9737763559972235 getty: 0.005966230401874472 image: 0.0019986892792346893 21wire: 0.0019524007710381203 via: 0.0010386570095431342 obama: 0.0009606127209026892 zika: 0.0007863500518846811 group: 0.00048198222170253165 hillary: 0.00047331728474720217 patrol: 0.00036966585166126775
train_data
21078 If you can t get an acting role in Hollywood, ...
17856 NAIROBI (Reuters) - Kenyan opposition leader R...
6624 NEW YORK (Reuters) - The FBI acted inappropria...
6258 Donald Trump has been getting heavy media cove...
2012 The following statements were posted to the ve...
...
9221 The exclusive below from Fox news skips the fa...
14890 // <![CDATA[ (function(d, s, id) { var js, fjs...
2706 MONTEREY, Calif. (Reuters) - The California Pu...
8123 WASHINGTON (Reuters) - U.S. Republican preside...
7670 WASHINGTON (Reuters) - A bipartisan group of l...
Name: text, Length: 30678, dtype: object
The first run seems quite successful. However, we see that the top most important feature is "reuters". You can see that in some cases "reuters" is written in the beginning of the text. Since the dataset under study sourced most of its "true" articles from reuters, this should be no real surprise. We will probably have to remove this... However, let's just see how this tree model did.
pred = tree_model.predict(test_features)
import numpy as np
print("RMS: %r " % np.sqrt(np.mean((pred - test_labels) ** 2)))
RMS: 0.07130740328122925
loc_test_labels = test_labels.reset_index()
sum_correct = 0
for i in range(0, len(pred)-1):
if pred[i] == loc_test_labels["fake"][i]:
sum_correct += 1
sum_correct
7630
len(pred)
7670
sum_correct / len(pred) # 99.5% items predicted positively in total
0.9947848761408083
It appears that this way of predicting is almost 100% successful, no wonder because of the reuters keyword. One may wonder if the authors of the study did or did not remove this in the course of their paper. The fact that this is still in the dataset is a key demonstrator for the use of explainable models in general, because in a study with biased data and a non-explainable model, we would have never stumbled upon this. In general, this happens frequently in research as well as in industry, leading to multiple consequences.
Because the keyword is in the data, we want to remove it first, but let us have a little overview in what the previous network implemented feature-wise. In the main study for reference (see below), they used stop-word removal, stemming and TF-IDF vectorizing in order to get to their depicted accuracy of models. The model above only implements the TF-IDF vectorizing aspect.
So for a next step, we will first remove the reuters keyword naively, so cutting "(Reuters)", and similar texts is next up. Then we rerun the same model with only the TF-IDF vectorizer and compare.
df_cleaned
| title | text | subject | date | fake | text_length | |
|---|---|---|---|---|---|---|
| 0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 | 0 | 4659 |
| 1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 | 0 | 4077 |
| 2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 | 0 | 2789 |
| 3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 | 0 | 2461 |
| 4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 | 0 | 5204 |
| ... | ... | ... | ... | ... | ... | ... |
| 22698 | The White House and The Theatrics of ‘Gun Cont... | 21st Century Wire says All the world s a stage... | US_News | January 7, 2016 | 1 | 7359 |
| 22699 | Activists or Terrorists? How Media Controls an... | Randy Johnson 21st Century WireThe majority ... | US_News | January 7, 2016 | 1 | 26275 |
| 22700 | BOILER ROOM – No Surrender, No Retreat, Heads ... | Tune in to the Alternate Current Radio Network... | US_News | January 6, 2016 | 1 | 1150 |
| 22701 | Federal Showdown Looms in Oregon After BLM Abu... | 21st Century Wire says A new front has just op... | US_News | January 4, 2016 | 1 | 20651 |
| 22702 | A Troubled King: Chicago’s Rahm Emanuel Desper... | 21st Century Wire says It s not that far away.... | US_News | January 2, 2016 | 1 | 5749 |
38348 rows × 6 columns
df_cleaned["text"] = df_cleaned["text"].map(lambda x: x.replace("(Reuters)", ""))
/tmp/ipykernel_78769/2126419442.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_cleaned["text"] = df_cleaned["text"].map(lambda x: x.replace("(Reuters)", ""))
df_cleaned["text"] = df_cleaned["text"].map(lambda x: x.replace("Reuters", ""))
/tmp/ipykernel_78769/3751951457.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_cleaned["text"] = df_cleaned["text"].map(lambda x: x.replace("Reuters", ""))
df_cleaned["text"] = df_cleaned["text"].map(lambda x: x.replace("reuters", ""))
/tmp/ipykernel_78769/4158596274.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_cleaned["text"] = df_cleaned["text"].map(lambda x: x.replace("reuters", ""))
df_cleaned
| title | text | subject | date | fake | text_length | |
|---|---|---|---|---|---|---|
| 0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON - The head of a conservative Repub... | politicsNews | December 31, 2017 | 0 | 4659 |
| 1 | U.S. military to accept transgender recruits o... | WASHINGTON - Transgender people will be allow... | politicsNews | December 29, 2017 | 0 | 4077 |
| 2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON - The special counsel investigatio... | politicsNews | December 31, 2017 | 0 | 2789 |
| 3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON - Trump campaign adviser George Pa... | politicsNews | December 30, 2017 | 0 | 2461 |
| 4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON - President Donald Trump c... | politicsNews | December 29, 2017 | 0 | 5204 |
| ... | ... | ... | ... | ... | ... | ... |
| 22698 | The White House and The Theatrics of ‘Gun Cont... | 21st Century Wire says All the world s a stage... | US_News | January 7, 2016 | 1 | 7359 |
| 22699 | Activists or Terrorists? How Media Controls an... | Randy Johnson 21st Century WireThe majority ... | US_News | January 7, 2016 | 1 | 26275 |
| 22700 | BOILER ROOM – No Surrender, No Retreat, Heads ... | Tune in to the Alternate Current Radio Network... | US_News | January 6, 2016 | 1 | 1150 |
| 22701 | Federal Showdown Looms in Oregon After BLM Abu... | 21st Century Wire says A new front has just op... | US_News | January 4, 2016 | 1 | 20651 |
| 22702 | A Troubled King: Chicago’s Rahm Emanuel Desper... | 21st Century Wire says It s not that far away.... | US_News | January 2, 2016 | 1 | 5749 |
38348 rows × 6 columns
As a little heads-up: we removed the key reuters, however, there still is the location mentioned first in most of the reuters articles. We will see if the locations now get flagged as truthful high important features...
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(df_cleaned['text'], df_cleaned['fake'], test_size=0.2, random_state=15)
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000)
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)
# Train a decision tree classifier
tree_model = DecisionTreeClassifier()
tree_model.fit(train_features, train_labels)
# Extract feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Visualize the decision tree
tree.plot_tree(tree_model, feature_names=feature_names)
# Get the most important features for classification
importance = tree_model.feature_importances_
important_features = [(feature_names[i], importance[i]) for i in range(len(feature_names))]
important_features.sort(key=lambda x: x[1], reverse=True)
# Print the top 10 important features
print("Top 10 important features:")
for feature, importance in important_features[:10]:
print(f"{feature}: {importance}")
Top 10 important features: via: 0.3736217156491609 said: 0.22057425618473947 you: 0.03710482407092774 read: 0.03469160153895598 featured: 0.024043081111359076 on: 0.019710078950901183 pic: 0.01886471493556264 this: 0.01678746515061844 https: 0.010358756509108275 com: 0.010351598843464656
pred = tree_model.predict(test_features)
loc_test_labels = test_labels.reset_index()
sum_correct = 0
for i in range(0, len(pred)-1):
if pred[i] == loc_test_labels["fake"][i]:
sum_correct += 1
sum_correct / len(pred) # 93.5% items predicted positively in total
0.9323337679269883
The inbetween model not containing the Reuters keyword anymore seems to have a new champion in important features. The word "via". We will quickly look into this manually and check wether it is used in some sense that is far off from the use-case comparably like "reuters".
df.iloc[5,1]
'WEST PALM BEACH, Fla./WASHINGTON (Reuters) - The White House said on Friday it was set to kick off talks next week with Republican and Democratic congressional leaders on immigration policy, government spending and other issues that need to be wrapped up early in the new year. The expected flurry of legislative activity comes as Republicans and Democrats begin to set the stage for midterm congressional elections in November. President Donald Trump’s Republican Party is eager to maintain control of Congress while Democrats look for openings to wrest seats away in the Senate and the House of Representatives. On Wednesday, Trump’s budget chief Mick Mulvaney and legislative affairs director Marc Short will meet with Senate Majority Leader Mitch McConnell and House Speaker Paul Ryan - both Republicans - and their Democratic counterparts, Senator Chuck Schumer and Representative Nancy Pelosi, the White House said. That will be followed up with a weekend of strategy sessions for Trump, McConnell and Ryan on Jan. 6 and 7 at the Camp David presidential retreat in Maryland, according to the White House. The Senate returns to work on Jan. 3 and the House on Jan. 8. Congress passed a short-term government funding bill last week before taking its Christmas break, but needs to come to an agreement on defense spending and various domestic programs by Jan. 19, or the government will shut down. Also on the agenda for lawmakers is disaster aid for people hit by hurricanes in Puerto Rico, Texas and Florida, and by wildfires in California. The House passed an $81 billion package in December, which the Senate did not take up. The White House has asked for a smaller figure, $44 billion. Deadlines also loom for soon-to-expire protections for young adult immigrants who entered the country illegally as children, known as “Dreamers.” In September, Trump ended Democratic former President Barack Obama’s Deferred Action for Childhood Arrivals (DACA) program, which protected Dreamers from deportation and provided work permits, effective in March, giving Congress until then to devise a long-term solution. Democrats, some Republicans and a number of large companies have pushed for DACA protections to continue. Trump and other Republicans have said that will not happen without Congress approving broader immigration policy changes and tougher border security. Democrats oppose funding for a wall promised by Trump along the U.S.-Mexican border. “The Democrats have been told, and fully understand, that there can be no DACA without the desperately needed WALL at the Southern Border and an END to the horrible Chain Migration & ridiculous Lottery System of Immigration etc,” Trump said in a Twitter post on Friday. Trump wants to overhaul immigration rules for extended families and others seeking to live in the United States. Republican U.S. Senator Jeff Flake, a frequent critic of the president, said he would work with Trump to protect Dreamers. “We can fix DACA in a way that beefs up border security, stops chain migration for the DREAMers, and addresses the unfairness of the diversity lottery. If POTUS (Trump) wants to protect these kids, we want to help him keep that promise,” Flake wrote on Twitter. Congress in early 2018 also must raise the U.S. debt ceiling to avoid a government default. The U.S. Treasury would exhaust all of its borrowing options and run dry of cash to pay its bills by late March or early April if Congress does not raise the debt ceiling before then, according to the nonpartisan Congressional Budget Office. Trump, who won his first major legislative victory with the passage of a major tax overhaul this month, has also promised a major infrastructure plan. '
df[df['text'].str.contains(' via ')]["text"].iloc[10]
'BEIJING (Reuters) - U.S. President Donald Trump went around and over the “Great Firewall” of China in a late-night tweet in Beijing as he thanked his hosts for a rare tour of the Forbidden City and a private dinner at the sprawling, centuries-old palace complex. Many Western social media platforms such as Twitter and Facebook are banned in China. A sophisticated system has been built to deny online users within China access to blocked content. That was not an issue for Trump, known for tweeting to his 42.3 million followers at any hour of the day, on Wednesday, the day he arrived in Beijing. “On behalf of @FLOTUS Melania and I, THANK YOU for an unforgettable afternoon and evening at the Forbidden City in Beijing, President Xi and Madame Peng Liyuan. We are looking forward to rejoining you tomorrow morning!” Trump even changed his Twitter banner, uploading a photograph of himself and Melania with Chinese President Xi Jinping and his wife, Peng Liyuan, during a Chinese opera performance at the Forbidden City. The Twitter banner upload did not go unnoticed by Chinese state media, with state broadcaster CCTV flashing screenshots of the photograph on Thursday. Trump’s visit was also the third-most talked-about topic on Chinese social media platform Weibo over the last 24 hours, trailing only the birthday of a singer in a Chinese boy band and a weekly Asian pop song chart. Many people wondered how Trump managed to evade China’s tough internet controls. “I guess he must have done it via wifi on a satellite network,” said a user on Weibo. Many foreigners log on to virtual private networks (VPNs) to access content hosted outside of China. Another option is to sign up for a data-roaming service before leaving one’s home country. “The president will tweet whatever he wants. That’s his way of communicating directly with the American people. Why not?” a White House official said ahead of Trump’s arrival in Beijing on Wednesday. When asked whether China considers Trump’s use of Twitter to be in breach of Chinese law, Foreign Ministry spokeswoman Hua Chunying said there were many means of communication with “the outside world”. “In China, people have many channels to communicate, it’s just that they communicate in different ways,” Hua said at a regular ministry briefing. “For example, some people use WeChat, some people use Weibo. Some people use Apple phones, some people use Huawei phones.” Trump tweeted again on Thursday afternoon, posting an ABC News video montage of the “incredible” welcome parade at the Great Hall of the People, where he was greeted by a military band and jumping, flag-waving children. In his tweet, Trump embedded a link to a photograph of his Beijing visit on Instagram - also forbidden in China. Not all of Trump’s tweets in China were bright and cheerful. “NoKo has interpreted America’s past restraint as weakness,” he tweeted about reclusive North Korea’s nuclear and missile threats. “This would be a fatal miscalculation. Do not underestimate us. AND DO NOT TRY US.” '
From manual inspection, we did not find any clues to this when looking at the uncleaned text. So it probably is just used as a normal word and fake news do not like it (as it might be used as a word to establish proof/references). What is interesting is that the location like Beijing in the above example does not play a role.
df_cleaned["text"]
0 WASHINGTON - The head of a conservative Repub...
1 WASHINGTON - Transgender people will be allow...
2 WASHINGTON - The special counsel investigatio...
3 WASHINGTON - Trump campaign adviser George Pa...
4 SEATTLE/WASHINGTON - President Donald Trump c...
...
22698 21st Century Wire says All the world s a stage...
22699 Randy Johnson 21st Century WireThe majority ...
22700 Tune in to the Alternate Current Radio Network...
22701 21st Century Wire says A new front has just op...
22702 21st Century Wire says It s not that far away....
Name: text, Length: 38348, dtype: object
We think that from a data cleaning standpoint, this dataset is now quite solid, especially since we removed the indicator reuters. The next thing to do is to move on to feature selection, specifically applying the methods used in the paper, which we will quickly list and explain here:
Stop Word Removal:
Insignificant words like about, that, ... may create noise in such models like we have here. They are commonly used and can especially be "noisy" when using n-Gram type of analysis. In the paper, they stated that they "removed common words such as, a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or, that, the, these, this, too, was, what, when, where, who, will, etc.".
As this list is not very specific and we do not have the code they used (although asking for it via e-mail), we might as well just rely on the preselected list of stopwords from scikit. The option stop_words = 'english' uses the prebuilt list.
The following code will rerun the model above with just this addition in order to compare and contrast the change.
# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(df_cleaned['text'], df_cleaned['fake'], test_size=0.2, random_state=15)
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)
# Train a decision tree classifier
tree_model = DecisionTreeClassifier()
tree_model.fit(train_features, train_labels)
# Extract feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Visualize the decision tree
tree.plot_tree(tree_model, feature_names=feature_names)
# Get the most important features for classification
importance = tree_model.feature_importances_
important_features = [(feature_names[i], importance[i]) for i in range(len(feature_names))]
important_features.sort(key=lambda x: x[1], reverse=True)
# Print the top 10 important features
print("Top 10 important features:")
for feature, importance in important_features[:10]:
print(f"{feature}: {importance}")
Top 10 important features: said: 0.36459215533943434 image: 0.18162157269770549 read: 0.038377607455185106 minister: 0.028863064188978608 washington: 0.02378511619609676 just: 0.02144585968585312 pic: 0.01743437871586469 com: 0.015059904951823782 didn: 0.014563184290509245 https: 0.00940473852375553
pred = tree_model.predict(test_features)
loc_test_labels = test_labels.reset_index()
sum_correct = 0
for i in range(0, len(pred)-1):
if pred[i] == loc_test_labels["fake"][i]:
sum_correct += 1
sum_correct / len(pred) # 92.6% items predicted positively in total
0.9245110821382008
In a trial and error fashion, we found that just applying stop word removal lowers the accuracy by around a full percent, leading to the "via" feature to disappear. Upon looking it up, it really is marked as a stop-word. We accept this for now, moving on to other techniques. As a little heads-up: we looked at literature regarding text pre-processing and found the following graphic representation of mostly used text pre-processing techniques from https://towardsdatascience.com/elegant-text-pre-processing-with-nltk-in-sklearn-pipeline-d6fe18b91eb8

Note that the paper in general uses some of the older examples, however, does not mention noise removal or some more advanced techniques like Lemmatization.
Moving on to the next step in the paper, they use stemming, specifically the "porter stemmer", which they mention to be the most commonly used stemming algorithm due to its accuracy (at the time of writing!). The NLTK package offers a stemmer also applying the porter algorithm. We will use this.
from nltk.stem.porter import *
stemmer = PorterStemmer()
df_cleaned["text_stemmed"] = df_cleaned['text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
/tmp/ipykernel_78769/3504799208.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_cleaned["text_stemmed"] = df_cleaned['text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))
df_cleaned["text"].iloc[1]
'WASHINGTON - Transgender people will be allowed for the first time to enlist in the U.S. military starting on Monday as ordered by federal courts, the Pentagon said on Friday, after President Donald Trump’s administration decided not to appeal rulings that blocked his transgender ban. Two federal appeals courts, one in Washington and one in Virginia, last week rejected the administration’s request to put on hold orders by lower court judges requiring the military to begin accepting transgender recruits on Jan. 1. A Justice Department official said the administration will not challenge those rulings. “The Department of Defense has announced that it will be releasing an independent study of these issues in the coming weeks. So rather than litigate this interim appeal before that occurs, the administration has decided to wait for DOD’s study and will continue to defend the president’s lawful authority in District Court in the meantime,” the official said, speaking on condition of anonymity. In September, the Pentagon said it had created a panel of senior officials to study how to implement a directive by Trump to prohibit transgender individuals from serving. The Defense Department has until Feb. 21 to submit a plan to Trump. Lawyers representing currently-serving transgender service members and aspiring recruits said they had expected the administration to appeal the rulings to the conservative-majority Supreme Court, but were hoping that would not happen. Pentagon spokeswoman Heather Babb said in a statement: “As mandated by court order, the Department of Defense is prepared to begin accessing transgender applicants for military service Jan. 1. All applicants must meet all accession standards.” Jennifer Levi, a lawyer with gay, lesbian and transgender advocacy group GLAD, called the decision not to appeal “great news.” “I’m hoping it means the government has come to see that there is no way to justify a ban and that it’s not good for the military or our country,” Levi said. Both GLAD and the American Civil Liberties Union represent plaintiffs in the lawsuits filed against the administration. In a move that appealed to his hard-line conservative supporters, Trump announced in July that he would prohibit transgender people from serving in the military, reversing Democratic President Barack Obama’s policy of accepting them. Trump said on Twitter at the time that the military “cannot be burdened with the tremendous medical costs and disruption that transgender in the military would entail.” Four federal judges - in Baltimore, Washington, D.C., Seattle and Riverside, California - have issued rulings blocking Trump’s ban while legal challenges to the Republican president’s policy proceed. The judges said the ban would likely violate the right under the U.S. Constitution to equal protection under the law. The Pentagon on Dec. 8 issued guidelines to recruitment personnel in order to enlist transgender applicants by Jan. 1. The memo outlined medical requirements and specified how the applicants’ sex would be identified and even which undergarments they would wear. The Trump administration previously said in legal papers that the armed forces were not prepared to train thousands of personnel on the medical standards needed to process transgender applicants and might have to accept “some individuals who are not medically fit for service.” The Obama administration had set a deadline of July 1, 2017, to begin accepting transgender recruits. But Trump’s defense secretary, James Mattis, postponed that date to Jan. 1, 2018, which the president’s ban then put off indefinitely. Trump has taken other steps aimed at rolling back transgender rights. In October, his administration said a federal law banning gender-based workplace discrimination does not protect transgender employees, reversing another Obama-era position. In February, Trump rescinded guidance issued by the Obama administration saying that public schools should allow transgender students to use the restroom that corresponds to their gender identity. '
df_cleaned["text_stemmed"].iloc[1]
'washington - transgend peopl will be allow for the first time to enlist in the u.s. militari start on monday as order by feder courts, the pentagon said on friday, after presid donald trump’ administr decid not to appeal rule that block hi transgend ban. two feder appeal courts, one in washington and one in virginia, last week reject the administration’ request to put on hold order by lower court judg requir the militari to begin accept transgend recruit on jan. 1. a justic depart offici said the administr will not challeng those rulings. “the depart of defens ha announc that it will be releas an independ studi of these issu in the come weeks. so rather than litig thi interim appeal befor that occurs, the administr ha decid to wait for dod’ studi and will continu to defend the president’ law author in district court in the meantime,” the offici said, speak on condit of anonymity. in september, the pentagon said it had creat a panel of senior offici to studi how to implement a direct by trump to prohibit transgend individu from serving. the defens depart ha until feb. 21 to submit a plan to trump. lawyer repres currently-serv transgend servic member and aspir recruit said they had expect the administr to appeal the rule to the conservative-major suprem court, but were hope that would not happen. pentagon spokeswoman heather babb said in a statement: “a mandat by court order, the depart of defens is prepar to begin access transgend applic for militari servic jan. 1. all applic must meet all access standards.” jennif levi, a lawyer with gay, lesbian and transgend advocaci group glad, call the decis not to appeal “great news.” “i’m hope it mean the govern ha come to see that there is no way to justifi a ban and that it’ not good for the militari or our country,” levi said. both glad and the american civil liberti union repres plaintiff in the lawsuit file against the administration. in a move that appeal to hi hard-lin conserv supporters, trump announc in juli that he would prohibit transgend peopl from serv in the military, revers democrat presid barack obama’ polici of accept them. trump said on twitter at the time that the militari “cannot be burden with the tremend medic cost and disrupt that transgend in the militari would entail.” four feder judg - in baltimore, washington, d.c., seattl and riverside, california - have issu rule block trump’ ban while legal challeng to the republican president’ polici proceed. the judg said the ban would like violat the right under the u.s. constitut to equal protect under the law. the pentagon on dec. 8 issu guidelin to recruit personnel in order to enlist transgend applic by jan. 1. the memo outlin medic requir and specifi how the applicants’ sex would be identifi and even which undergar they would wear. the trump administr previous said in legal paper that the arm forc were not prepar to train thousand of personnel on the medic standard need to process transgend applic and might have to accept “some individu who are not medic fit for service.” the obama administr had set a deadlin of juli 1, 2017, to begin accept transgend recruits. but trump’ defens secretary, jame mattis, postpon that date to jan. 1, 2018, which the president’ ban then put off indefinitely. trump ha taken other step aim at roll back transgend rights. in october, hi administr said a feder law ban gender-bas workplac discrimin doe not protect transgend employees, revers anoth obama-era position. in february, trump rescind guidanc issu by the obama administr say that public school should allow transgend student to use the restroom that correspond to their gender identity.'
# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(df_cleaned['text_stemmed'], df_cleaned['fake'], test_size=0.2, random_state=15)
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)
# Train a decision tree classifier
tree_model = DecisionTreeClassifier()
tree_model.fit(train_features, train_labels)
# Extract feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Visualize the decision tree
tree.plot_tree(tree_model, feature_names=feature_names)
# Get the most important features for classification
importance = tree_model.feature_importances_
important_features = [(feature_names[i], importance[i]) for i in range(len(feature_names))]
important_features.sort(key=lambda x: x[1], reverse=True)
# Print the top 10 important features
print("Top 10 important features:")
for feature, importance in important_features[:10]:
print(f"{feature}: {importance}")
Top 10 important features: said: 0.36124274622280655 th: 0.1721453257137617 imag: 0.13947697039839485 thi: 0.024336756121572507 com: 0.016880920457508365 https: 0.014588657466668521 washington: 0.010993220478723352 featur: 0.010899856336458196 minist: 0.01075947759642547 watch: 0.010046320346391833
pred = tree_model.predict(test_features)
loc_test_labels = test_labels.reset_index()
sum_correct = 0
for i in range(0, len(pred)-1):
if pred[i] == loc_test_labels["fake"][i]:
sum_correct += 1
sum_correct / len(pred) # 94.9% items predicted positively in total
0.9511082138200783
The resulting stemmed column will now be used as input for the TFIDF Vectorizer and the model in general. You can see the new model and accuracy above. Please note that because of the stemming, we had to move to a different, more powerful machine, since this takes much resources. This may or may not make it unviable for you to run the notebook on your machine.
Next up, the paper incorporates a 5-fold cross validation. This ensures that the results are stable. We will incorporate this down here.
from sklearn.model_selection import cross_val_score
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
Cross-Validation Accuracy Scores: Fold 1: 0.9469361147327249 Fold 2: 0.9498044328552803 Fold 3: 0.9311603650586702 Fold 4: 0.9183726691876385 Fold 5: 0.9205893858390924
scores.mean()
0.9333725935346813
We can see that the average accuracy dropped a little bit, however, it still stays pretty high. As a first reference to the studies results, the "highest" achieved Accuracy usingn Decision trees was achieved by using TF-IDF in a Unigram setting. It was 89.0 percent. This means that our model scores significantly better than the model in the study, while not even setting the maximum features to higher values or tweaking additional hyperparameters. The two decision tree models that scored best were the ones using 10 thousand and 50 thousand maximum features. Therefore, we suggest testing this next, also applying a k-fold validation.
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
print("Average Accuracy: ", scores.mean())
Cross-Validation Accuracy Scores: Fold 1: 0.9569752281616688 Fold 2: 0.9494132985658409 Fold 3: 0.9246414602346805 Fold 4: 0.919285434867649 Fold 5: 0.920067805450515 Average Accuracy: 0.9340766454560707
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=50000, stop_words='english')
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
print("Average Accuracy: ", scores.mean())
Cross-Validation Accuracy Scores: Fold 1: 0.9550195567144719 Fold 2: 0.9491525423728814 Fold 3: 0.926857887874837 Fold 4: 0.913287260399009 Fold 5: 0.9217629417133916 Average Accuracy: 0.9332160378149181
There does not seem to be a huge difference, I would say a minimal one at best. The only setting that we did not really touch yet, that the study touched, are the ngrams. Lets try bigrams first using the 5k features model. Among all the ngram comparisons, it scored best on average.
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(2, 2))
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
print("Average Accuracy: ", scores.mean())
Cross-Validation Accuracy Scores: Fold 1: 0.9466753585397654 Fold 2: 0.8955671447196871 Fold 3: 0.8934810951760104 Fold 4: 0.873516755769983 Fold 5: 0.8590428999869605 Average Accuracy: 0.8936566508384812
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(3, 3))
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
print("Average Accuracy: ", scores.mean())
Cross-Validation Accuracy Scores: Fold 1: 0.8928292046936115 Fold 2: 0.8148631029986962 Fold 3: 0.8284224250325946 Fold 4: 0.8354413874038337 Fold 5: 0.8126222454035729 Average Accuracy: 0.8368356731064619
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(4, 4))
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
print("Average Accuracy: ", scores.mean())
Cross-Validation Accuracy Scores: Fold 1: 0.8447196870925684 Fold 2: 0.7940026075619296 Fold 3: 0.7646675358539765 Fold 4: 0.7469031164428217 Fold 5: 0.7426000782370583 Average Accuracy: 0.7785786050376708
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 4))
X = vectorizer.fit_transform(df_cleaned["text_stemmed"])
y = df_cleaned["fake"]
# Train a decision tree classifier using 5-fold-cross-validation
tree_model = DecisionTreeClassifier()
scores = cross_val_score(tree_model, X, y, cv=5)
# Print the cross-validation accuracy scores
print("Cross-Validation Accuracy Scores:")
for fold, score in enumerate(scores, start=1):
print(f"Fold {fold}: {score}")
print("Average Accuracy: ", scores.mean())
Cross-Validation Accuracy Scores: Fold 1: 0.9440677966101695 Fold 2: 0.8984354628422425 Fold 3: 0.9319426336375489 Fold 4: 0.913678445690442 Fold 5: 0.9158951623418959 Average Accuracy: 0.9208039002244599
We can now see that the accuracy declines with a heightened ngram setting. The last option is variable, where SKLEARN tries all variants and finds the best estimators.
When comparing it with the non-n-gram model, this one scores about 1 percent of mean accuracy worse, however, the maximum accuracy is just lowered by around 0.5 percent.
We can conclude that in this setting, the unigram model seems to be the way to go forward.
Comparing this to the LSVM TF-IDF model with maximum features 50 thousand, that scored 92.0 percent accuracy in the study, we got to a quite satisfying accuracy. Mentioning the fact that more modern technology and packages make this process seemingly easy.
The next main task is to illustrate the explainability that such a tree model offers. Above in the case of the reuters mistake, we could already spot one main advantage, since we can "rank" the decision points and look into reasons why this might be the case. However, suspecting that lower down values might not really make sense, we are first going to plot the decision tree in a little bigger manner, looking into possible paths.
#Best Model so far!
# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(df_cleaned['text_stemmed'], df_cleaned['fake'], test_size=0.2, random_state=15)
# Create a TF-IDF vectorizer to convert text into numerical features
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
train_features = vectorizer.fit_transform(train_data)
test_features = vectorizer.transform(test_data)
# Train a decision tree classifier
tree_model = DecisionTreeClassifier()
tree_model.fit(train_features, train_labels)
# Extract feature names from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Visualize the decision tree
tree.plot_tree(tree_model, feature_names=feature_names)
# Get the most important features for classification
importance = tree_model.feature_importances_
important_features = [(feature_names[i], importance[i]) for i in range(len(feature_names))]
important_features.sort(key=lambda x: x[1], reverse=True)
# Print the top 10 important features
print("Top 10 important features:")
for feature, importance in important_features[:10]:
print(f"{feature}: {importance}")
Top 10 important features: said: 0.3613277828063353 th: 0.17198717529726557 imag: 0.13947697039839485 thi: 0.023923389424487344 com: 0.016880920457508365 https: 0.014223935510835683 washington: 0.010960272475286654 minist: 0.010935200281421182 featur: 0.010899856336458196 watch: 0.01026647291480976
pred = tree_model.predict(test_features)
loc_test_labels = test_labels.reset_index()
sum_correct = 0
for i in range(0, len(pred)-1):
if pred[i] == loc_test_labels["fake"][i]:
sum_correct += 1
sum_correct / len(pred) # 95.2% items predicted positively in total
0.9517601043024772
from matplotlib import pyplot as plt
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (30,30), dpi=600)
tree.plot_tree(tree_model, filled=True, feature_names=feature_names)
plt.savefig('tree.png', dpi=600)
plt.show()
The image "tree.png" is saved in a pretty high resolution, so one can inspect it manually. The arbitrary node annotates the word used to a path further down the tree, indicating the condition that ONE feature as a whole should hold. The child on the left hand side is chosen as a further path if the condition is held, and the left one if not. The so called "leaf node", a node with no further children, marks the final decision. The sample mark tells us how many of the train set articles did come to this point, and the value indicates how many of the samples are either fake or not.
So far we do not really know what the value here means, since either fake or not fake might be first or second in this array. To find out, we can apply a technique that is used in many types of models to provide explainability, but it should hold very much value for the tree model especially. Note that we will explain what the features values can mean (TF-IDF feature).
We first talked about stop words removal and stemming. The following text already went through those.
test_data[8232]
'new york - donald trump declar on wednesday that russia’ vladimir putin had been a better leader than u.s. presid barack obama, as the republican presidenti nomine use a televis forum to argu he wa best equip to reassert america’ global leadership. trump suggest at the event in which he and democrat rival hillari clinton made back-to-back appear that u.s. gener had been stymi by the polici of obama and clinton, who serv as the democrat president’ first secretari of state. “i think under the leadership of barack obama and hillari clinton the gener have been reduc to rubble. they have been reduc to a point that’ embarrass for our country,” trump said at nbc’ “commander-in-chief” forum in new york attend by militari veterans. it wa the first time trump and clinton had squar off on the same stage sinc accept their parties’ presidenti nomin in juli for the nov. 8 election. clinton wa grill over her handl of classifi inform while use a privat email server dure her tenur at the state department. fbi director jame comey had declar her “extrem careless” in her handl of sensit materi but did not recommend charg against her. “i did exactli what i should have done and i take it veri seriously, alway have, alway will,” she said.trump’ prais of putin and hi suggest that the unit state and russia form an allianc to defeat islam state milit could rais eyebrow among foreign polici expert who feel moscow is interf with effort to stem the syrian civil war. “if he say great thing about me, i’m go to say great thing about him,” trump said of the russian president. “certainli in that system, he’ been a leader, far more than our presid ha been.” trump had call obama “the founder of isis,” an acronym for islam state, in stump speech sever week ago. the statement drew broad criticism, prompt him to take a more disciplin approach to campaigning. he ha sinc pick up ground on clinton in nation opinion polls. trump also flirt with reveal what he had been learn in classifi intellig brief given to him by u.s. offici becaus he is the republican nominee. “there wa one thing that shock me,” trump said. “what i did learn is that our leadership, barack obama, did not follow what our expert ... said to do, and i wa very, veri surprised. ...our leader were not follow what they recommended.” earlier on wednesday, trump pledg to launch a new u.s. militari buildup, say america wa under threat like never befor from foe like islamist extremists, north korea and china. the event offer a prelud to how clinton and trump will deal with question on nation secur issu in their three upcom presidenti debat later in septemb and in october. clinton began the forum say her long experi in govern as a u.s. senat and secretari of state made her uniqu qualifi to serv as president. she said she had “an absolut rock steadiness” to be abl to make tough decisions, a not so subtl dig at trump, who democrat say is temperament unfit for the white house. moder matt lauer doggedli press her about her handl of email from a privat server while secretari of state from 2009 to 2013. the issu ha rais question about whether she can be trust to serv as president. clinton said none of the email she sent or receiv were mark top secret, secret or classified, the usual way such materi is identified. appear in the second half of the hour-long show, trump face question about hi fit for office. ask if he would be prepar on day one to be command in chief, trump said: “one hundr percent.” trump quickli abandon lauer’ entreati to avoid attack hi oppon and focu on what he would do if elect presid in november. “she’ been there for 30 years,” trump said. “we need change, and we need it fast.” the event brought togeth the meticul prepar clinton, 68, the wife of former presid bill clinton, and trump, 70, a new york businessman whose brash, freewheel style ha allow him to domin the headlin dure hi campaign. clinton said she regret her decis as a u.s. senat from new york to vote in favor of the much-critic 2003 iraq war and that trump had been in favor of it as well. trump ha condemn the war dure hi campaign and said he would avoid lengthi conflict in the middl east. on the u.s. intervent in libya in 2011, clinton reject trump’ critic of her support for the effort as secretari of state. “permit there to be an ongo civil war in libya would be as threaten and as danger as what we are see in syria,” she said. trump said clinton’ handl of libya prove disastrous. republican have made much of the fact that the u.s. ambassador to libya, chri stevens, wa kill in an islamist attack in benghazi, libya, in 2012. “she made a terribl mistak in libya,” said trump. clinton said u.s. polici under her leadership at the state depart had help promot security. “we made the world safer,” she said.'
Now, feature selection, in our case, the TF-IDF measure transforms this paragraph into a sparse matrix. The study under evaluation uses the following paragraph to explain TF-IDF:
"The Term Frequency-Inverted Document Frequency (TF-IDF) is a weighting metric often used in information retrieval and natural language processing. It is a statistical metric used to measure how important a term is to a document in a dataset. A term importance increases with the number of times a word appears in the document, however, this is counteracted by the frequency of the word in the corpus. One of the main characteristics of IDF is it weights down the term frequency while scaling up the rare ones. For example, words such as “the” and “then” often appear in the text, and if we only use TF, terms such as these will dominate the frequency count. However, using IDF scales down the impact of these terms."
print(test_features[0])
(0, 35) 0.04808872773051405 (0, 41) 0.04013522497697604 (0, 43) 0.03850084434050053 (0, 44) 0.03515658570186762 (0, 45) 0.03556124969086618 (0, 71) 0.031482235510104956 (0, 115) 0.05172944953379566 (0, 117) 0.04106598582655754 (0, 134) 0.04349846118999503 (0, 143) 0.030728138071455056 (0, 150) 0.03474201847174634 (0, 159) 0.0331720549924528 (0, 245) 0.03016442970954729 (0, 281) 0.04174231216569779 (0, 285) 0.025222406818486965 (0, 293) 0.0676129980676364 (0, 296) 0.03749581960914753 (0, 301) 0.04850264006562163 (0, 357) 0.0525913956603475 (0, 365) 0.03664933994080295 (0, 380) 0.03462630570335908 (0, 403) 0.022196761245201285 (0, 432) 0.048922865351767056 (0, 435) 0.03490436801466689 (0, 467) 0.07033644908727302 : : (0, 4550) 0.01782057367880211 (0, 4559) 0.03496105576688818 (0, 4580) 0.038603681848526694 (0, 4644) 0.32116097700980184 (0, 4645) 0.03843295076684736 (0, 4702) 0.051614460217132486 (0, 4711) 0.051576463510497945 (0, 4712) 0.01932643051458175 (0, 4732) 0.047521893599842796 (0, 4743) 0.04079801302394682 (0, 4745) 0.039210400378089906 (0, 4774) 0.04515675771372111 (0, 4783) 0.054008025321505046 (0, 4813) 0.03839920033146977 (0, 4819) 0.02293406312585752 (0, 4827) 0.08053518068821988 (0, 4841) 0.108063831097098 (0, 4862) 0.021616551055479404 (0, 4875) 0.048023702740727425 (0, 4876) 0.020589801354013677 (0, 4898) 0.02105611168409977 (0, 4903) 0.036614926430348985 (0, 4946) 0.024008593707264155 (0, 4976) 0.02774573378634365 (0, 4983) 0.10779853296321389
The index shows the relation of the keyword to the others. This can be used to establish context.
tree_model.predict(test_features[0])
array([0])
Above you can see what our model would predict for the sparse matrix as input - it would classify it as no fake news. (0 = real news, 1 = fake news) The IDs correspond to the feature IDs, which can be put into names (as we did in the tree model visualization)
len(feature_names) #This is 5000 long because we established max_features = 5000
5000
feature_names
array(['00', '000', '10', ..., 'zone', 'zor', 'zuma'], dtype=object)
print(tree_model.decision_path(test_features[0]))
(0, 0) 1 (0, 648) 1 (0, 649) 1 (0, 650) 1 (0, 651) 1 (0, 652) 1 (0, 653) 1 (0, 654) 1 (0, 655) 1 (0, 656) 1 (0, 657) 1 (0, 658) 1 (0, 659) 1 (0, 660) 1 (0, 661) 1 (0, 662) 1 (0, 663) 1 (0, 664) 1 (0, 665) 1 (0, 666) 1 (0, 667) 1 (0, 668) 1 (0, 669) 1 (0, 670) 1 (0, 671) 1 : : (0, 756) 1 (0, 757) 1 (0, 758) 1 (0, 759) 1 (0, 760) 1 (0, 761) 1 (0, 762) 1 (0, 763) 1 (0, 764) 1 (0, 765) 1 (0, 766) 1 (0, 767) 1 (0, 768) 1 (0, 769) 1 (0, 770) 1 (0, 771) 1 (0, 772) 1 (0, 773) 1 (0, 774) 1 (0, 775) 1 (0, 776) 1 (0, 777) 1 (0, 778) 1 (0, 779) 1 (0, 780) 1
n_nodes = tree_model.tree_.node_count
children_left = tree_model.tree_.children_left
children_right = tree_model.tree_.children_right
feature = tree_model.tree_.feature
threshold = tree_model.tree_.threshold
X_test = test_features
node_indicator = tree_model.decision_path(X_test)
leaf_id = tree_model.apply(X_test)
sample_id = 2
# obtain ids of the nodes `sample_id` goes through, i.e., row `sample_id`
node_index = node_indicator.indices[
node_indicator.indptr[sample_id] : node_indicator.indptr[sample_id + 1]
]
print("Rules used to predict sample {id}:\n".format(id=sample_id))
for node_id in node_index:
# continue to the next node if it is a leaf node
if leaf_id[sample_id] == node_id:
continue
# check if value of the split feature for sample 0 is below threshold
if X_test[sample_id, feature[node_id]] <= threshold[node_id]:
threshold_sign = "<="
else:
threshold_sign = ">"
print(
"decision node {node} : (X_test[{sample}, {feature}] = {value}) "
"{inequality} {threshold})".format(
node=node_id,
sample=sample_id,
feature=feature[node_id],
value=X_test[sample_id, feature[node_id]],
inequality=threshold_sign,
threshold=threshold[node_id],
)
)
Rules used to predict sample 2: decision node 0 : (X_test[2, 3909] = 0.08257088598465868) > 0.03112895507365465) decision node 648 : (X_test[2, 4509] = 0.0) <= 0.004100584425032139) decision node 649 : (X_test[2, 2244] = 0.0) <= 0.02063043136149645) decision node 650 : (X_test[2, 2199] = 0.0) <= 0.007333007175475359) decision node 651 : (X_test[2, 4764] = 0.0) <= 0.035441914573311806) decision node 652 : (X_test[2, 1358] = 0.0) <= 0.038928279653191566) decision node 653 : (X_test[2, 3344] = 0.0) <= 0.04044116474688053) decision node 654 : (X_test[2, 1985] = 0.0) <= 0.03848862834274769) decision node 655 : (X_test[2, 4055] = 0.0) <= 0.015775397419929504) decision node 656 : (X_test[2, 1485] = 0.0) <= 0.017427759245038033) decision node 657 : (X_test[2, 3634] = 0.0) <= 0.04243004880845547) decision node 658 : (X_test[2, 4894] = 0.0) <= 0.010134678333997726) decision node 659 : (X_test[2, 2493] = 0.0) <= 0.06964515894651413) decision node 660 : (X_test[2, 3738] = 0.0) <= 0.0507810153067112) decision node 661 : (X_test[2, 232] = 0.0) <= 0.016683651134371758) decision node 662 : (X_test[2, 1090] = 0.0) <= 0.06726432964205742) decision node 663 : (X_test[2, 769] = 0.0) <= 0.017349570989608765) decision node 664 : (X_test[2, 1959] = 0.0) <= 0.01268040295690298) decision node 665 : (X_test[2, 1137] = 0.0) <= 0.05931176617741585) decision node 666 : (X_test[2, 4856] = 0.0) <= 0.03959180973470211) decision node 667 : (X_test[2, 355] = 0.0) <= 0.01828521490097046) decision node 668 : (X_test[2, 2148] = 0.0) <= 0.1901034712791443) decision node 669 : (X_test[2, 4182] = 0.0) <= 0.05524014122784138) decision node 670 : (X_test[2, 2073] = 0.0) <= 0.08285689726471901) decision node 671 : (X_test[2, 709] = 0.0) <= 0.02678176946938038) decision node 672 : (X_test[2, 2524] = 0.0) <= 0.058523016050457954) decision node 673 : (X_test[2, 4955] = 0.0) <= 0.0655912458896637) decision node 674 : (X_test[2, 2658] = 0.05010618757302066) <= 0.15058037638664246) decision node 675 : (X_test[2, 2735] = 0.0) <= 0.06458630785346031) decision node 676 : (X_test[2, 1920] = 0.0) <= 0.07530393078923225) decision node 677 : (X_test[2, 2960] = 0.0) <= 0.11118507385253906) decision node 678 : (X_test[2, 3974] = 0.0) <= 0.040849439799785614) decision node 679 : (X_test[2, 3565] = 0.0) <= 0.0705627016723156) decision node 680 : (X_test[2, 742] = 0.0) <= 0.10862131416797638) decision node 681 : (X_test[2, 1119] = 0.0) <= 0.037847088649868965) decision node 682 : (X_test[2, 708] = 0.0) <= 0.09091030433773994) decision node 683 : (X_test[2, 677] = 0.0) <= 0.11146101728081703) decision node 684 : (X_test[2, 3618] = 0.0) <= 0.06239195354282856) decision node 685 : (X_test[2, 858] = 0.0) <= 0.21910104900598526) decision node 686 : (X_test[2, 4786] = 0.0) <= 0.2799434959888458) decision node 687 : (X_test[2, 3240] = 0.0) <= 0.2579064592719078) decision node 688 : (X_test[2, 4054] = 0.0) <= 0.41695553064346313) decision node 689 : (X_test[2, 4064] = 0.0) <= 0.27712124586105347) decision node 690 : (X_test[2, 1564] = 0.0) <= 0.35001419484615326) decision node 691 : (X_test[2, 131] = 0.0) <= 0.11184950545430183) decision node 692 : (X_test[2, 1799] = 0.0) <= 0.023648111149668694) decision node 693 : (X_test[2, 3289] = 0.0) <= 0.6188528835773468) decision node 694 : (X_test[2, 1743] = 0.07470436825094597) <= 0.1328761950135231) decision node 695 : (X_test[2, 733] = 0.0) <= 0.07040472328662872) decision node 696 : (X_test[2, 530] = 0.0) <= 0.2118307650089264) decision node 697 : (X_test[2, 2965] = 0.0) <= 0.10029597207903862) decision node 698 : (X_test[2, 2125] = 0.0) <= 0.0889420434832573) decision node 699 : (X_test[2, 2779] = 0.0) <= 0.27493709325790405) decision node 700 : (X_test[2, 1037] = 0.0) <= 0.3406751751899719) decision node 701 : (X_test[2, 1326] = 0.0) <= 0.18150590360164642) decision node 702 : (X_test[2, 121] = 0.0) <= 0.12437033280730247) decision node 703 : (X_test[2, 271] = 0.0) <= 0.07590512186288834) decision node 704 : (X_test[2, 1855] = 0.0) <= 0.29999177157878876) decision node 705 : (X_test[2, 2963] = 0.0) <= 0.16315537691116333) decision node 706 : (X_test[2, 2800] = 0.0) <= 0.08704488351941109) decision node 707 : (X_test[2, 4009] = 0.0) <= 0.044869232922792435) decision node 708 : (X_test[2, 4651] = 0.0) <= 0.6769784390926361) decision node 709 : (X_test[2, 4637] = 0.0) <= 0.2396111637353897) decision node 710 : (X_test[2, 2299] = 0.0) <= 0.33450592309236526) decision node 711 : (X_test[2, 2236] = 0.0) <= 0.2921052575111389) decision node 712 : (X_test[2, 1439] = 0.0) <= 0.20616884157061577) decision node 713 : (X_test[2, 3531] = 0.0) <= 0.21827290952205658) decision node 714 : (X_test[2, 3118] = 0.06198541761786403) <= 0.22246916592121124) decision node 715 : (X_test[2, 2190] = 0.0) <= 0.3856894075870514) decision node 716 : (X_test[2, 3822] = 0.0) <= 0.30308666825294495) decision node 717 : (X_test[2, 2304] = 0.0) <= 0.2595519423484802) decision node 718 : (X_test[2, 3795] = 0.0) <= 0.465387761592865) decision node 719 : (X_test[2, 2739] = 0.0) <= 0.38565194606781006) decision node 720 : (X_test[2, 1165] = 0.0) <= 0.24729330092668533) decision node 721 : (X_test[2, 901] = 0.0) <= 0.08920750766992569) decision node 722 : (X_test[2, 4915] = 0.0) <= 0.36934812366962433) decision node 723 : (X_test[2, 3133] = 0.0) <= 0.23777402937412262) decision node 724 : (X_test[2, 3399] = 0.059269623923076456) <= 0.25794798135757446) decision node 725 : (X_test[2, 3309] = 0.0) <= 0.3814828395843506) decision node 726 : (X_test[2, 1981] = 0.0) <= 0.2419453263282776) decision node 727 : (X_test[2, 3529] = 0.0) <= 0.3781767413020134) decision node 728 : (X_test[2, 2881] = 0.0) <= 0.47116148471832275) decision node 729 : (X_test[2, 2683] = 0.0) <= 0.27298273146152496) decision node 730 : (X_test[2, 3466] = 0.0) <= 0.5025071501731873) decision node 731 : (X_test[2, 4391] = 0.0) <= 0.22987641394138336) decision node 732 : (X_test[2, 4893] = 0.0) <= 0.1964850127696991) decision node 733 : (X_test[2, 875] = 0.0) <= 0.24781974405050278) decision node 734 : (X_test[2, 573] = 0.0) <= 0.23447566106915474) decision node 735 : (X_test[2, 3235] = 0.0) <= 0.6826426982879639) decision node 736 : (X_test[2, 316] = 0.0) <= 0.24823792278766632) decision node 737 : (X_test[2, 4827] = 0.05193740068873335) <= 0.2624070942401886) decision node 738 : (X_test[2, 3408] = 0.0) <= 0.2446560636162758) decision node 739 : (X_test[2, 3670] = 0.0) <= 0.4789866805076599) decision node 740 : (X_test[2, 952] = 0.0) <= 0.23744390532374382) decision node 741 : (X_test[2, 849] = 0.0) <= 0.2791730463504791) decision node 742 : (X_test[2, 878] = 0.0) <= 0.4723604619503021) decision node 743 : (X_test[2, 4525] = 0.03325251020874442) <= 0.34229470789432526) decision node 744 : (X_test[2, 453] = 0.0) <= 0.5746106803417206) decision node 745 : (X_test[2, 52] = 0.0) <= 0.03588072583079338) decision node 746 : (X_test[2, 1495] = 0.0) <= 0.5142679810523987) decision node 747 : (X_test[2, 2899] = 0.0) <= 0.2235613614320755) decision node 748 : (X_test[2, 2485] = 0.0) <= 0.2314666137099266) decision node 749 : (X_test[2, 3724] = 0.0) <= 0.19017504155635834) decision node 750 : (X_test[2, 738] = 0.0) <= 0.3433455228805542) decision node 751 : (X_test[2, 2438] = 0.0) <= 0.0835178792476654) decision node 752 : (X_test[2, 667] = 0.0) <= 0.0697844997048378) decision node 753 : (X_test[2, 3986] = 0.0) <= 0.2876126170158386) decision node 754 : (X_test[2, 852] = 0.0) <= 0.14642544090747833) decision node 755 : (X_test[2, 1921] = 0.0) <= 0.11007019877433777) decision node 756 : (X_test[2, 2748] = 0.0) <= 0.17178154736757278) decision node 757 : (X_test[2, 1829] = 0.0) <= 0.2751568853855133) decision node 758 : (X_test[2, 3479] = 0.0) <= 0.15366841107606888) decision node 759 : (X_test[2, 610] = 0.0) <= 0.4208744913339615) decision node 760 : (X_test[2, 2852] = 0.0) <= 0.14081858098506927) decision node 761 : (X_test[2, 3665] = 0.0) <= 0.3083181232213974) decision node 762 : (X_test[2, 1446] = 0.0) <= 0.38813072443008423) decision node 763 : (X_test[2, 2798] = 0.0) <= 0.7539668381214142) decision node 764 : (X_test[2, 1651] = 0.0) <= 0.19380173832178116) decision node 765 : (X_test[2, 1912] = 0.0) <= 0.07279519364237785) decision node 766 : (X_test[2, 3317] = 0.0) <= 0.5230422019958496) decision node 767 : (X_test[2, 1934] = 0.0) <= 0.35934072732925415) decision node 768 : (X_test[2, 640] = 0.0) <= 0.10842181742191315) decision node 769 : (X_test[2, 4060] = 0.0) <= 0.13149722665548325) decision node 770 : (X_test[2, 3336] = 0.0) <= 0.28887443244457245) decision node 771 : (X_test[2, 2210] = 0.0) <= 0.4919208586215973) decision node 772 : (X_test[2, 4494] = 0.0) <= 0.08604681864380836) decision node 773 : (X_test[2, 781] = 0.0) <= 0.44687725603580475) decision node 774 : (X_test[2, 1318] = 0.0) <= 0.12311891466379166) decision node 775 : (X_test[2, 2202] = 0.0) <= 0.08503024652600288) decision node 776 : (X_test[2, 1333] = 0.0) <= 0.21778281778097153) decision node 777 : (X_test[2, 2219] = 0.0) <= 0.054079363122582436) decision node 778 : (X_test[2, 4366] = 0.0) <= 0.11365454643964767) decision node 779 : (X_test[2, 1976] = 0.0) <= 0.07552583888173103)
This code from https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html brings light into the decision path of sample 2 out of X_test. It allows us to see the path in the data.
We could now annotate this with the word corresponding to the feature and the value, but we can already see this in a better way by visualizing it again in a graph. Remember that we plotted the whole tree before, so contrasting this might be useful. We show the full graph with the nodes it passes through highlighted in a very bright color according to https://stackoverflow.com/questions/55878247/how-to-display-the-path-of-a-decision-tree-for-test-samples (with adaptions)
We will use the same code for another sample.
import pydotplus
dot_data = tree.export_graphviz(tree_model, out_file=None,
feature_names=feature_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
samples = X_test[514]
decision_paths = tree_model.decision_path(samples)
for decision_path in decision_paths:
for n, node_value in enumerate(decision_path.toarray()[0]):
if node_value == 0:
continue
node = graph.get_node(str(n))[0]
node.set_fillcolor('green')
labels = node.get_attributes()['label'].split('<br/>')
for i, label in enumerate(labels):
if label.startswith('samples = '):
labels[i] = 'samples = {}'.format(int(label.split('=')[1]) + 1)
node.set('label', '<br/>'.join(labels))
import graphviz
from IPython.display import Image
graph.write_png("decision_path_highlighted.png")
Image(graph.create_png())
You can look at the corresponding decision path under decision_path_highlighted.png, the resolution is too high for the notebook.
We found the article based on the text under: https://www.reuters.com/article/usa-trump-idUKL1N1NX1MC. It is a true article (although somewhat absurd)
test_data[514]
'washington - presid donald trump said there wa a “pocahontas” in the u.s. congress dure a meet on monday with nativ american world war two veteran in an appar derogatori refer to democrat senat elizabeth warren of massachusetts. after listen to one veteran speak at length about hi experi as a “navajo code talker” dure the war, trump heap prais on the veteran and said he would not give prepar remark himself. “you were here long befor ani of us were here,” trump said. “although we have a repres in congress who they say wa here a long time ago. they call her pocahontas.” trump repeatedli refer to warren as “pocahontas,” the name of a famou 17th-centuri nativ american, dure hi presidenti campaign in a mock refer to warren’ have said in the past that she had nativ american ancestry. warren, one of the senate’ most promin liber democrats, is a note legal scholar who taught at harvard law school and serv as an advis to former presid barack obama befor she wa elect to the senat in 2012. “it is deepli unfortun that the presid of the unit state cannot even make it through a ceremoni honor these hero without have to throw out a racial slur,” warren said on msnbc. white hous spokeswoman sarah sander disput the character of trump’ remark as a racial slur. “i think what most peopl find offens is senat warren lie about her heritag to advanc her career,” sander told reporters. jefferson keel, presid of the nation congress of american indians, question the “use of the name pocahonta as a slur ... onc again, we call upon the presid to refrain from use her name in a way that denigr her legacy.” trump’ comment immedi trend on social media. the word “pocahontas” appear 12 time on twitter everi second, accord to social media analyt compani zoomph. trump’ knock at warren came as hi administr is embroil in controversi over the consum financi protect board, which warren help develop befor enter politics. the agency, set up to protect american from abus lend practic after the financi crisis, ha been under attack by trump sinc he took offic in january. on friday, trump name hi budget director as the interim head of the agency, after it outgo chief name someon els to the job, set up a court battle.'
test_labels[514]
0
Looking at the green path we can see that we selected a sample with the probably longest path with a leaf node that indicates fake news as zero, meaning that value = [first, second] illustrates that the first is no fake news and the second one is fake news. Therefore, the model worked pretty well. Let's look at another example now.
import pydotplus
dot_data = tree.export_graphviz(tree_model, out_file=None,
feature_names=feature_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
samples = X_test[700]
decision_paths = tree_model.decision_path(samples)
for decision_path in decision_paths:
for n, node_value in enumerate(decision_path.toarray()[0]):
if node_value == 0:
continue
node = graph.get_node(str(n))[0]
node.set_fillcolor('green')
labels = node.get_attributes()['label'].split('<br/>')
for i, label in enumerate(labels):
if label.startswith('samples = '):
labels[i] = 'samples = {}'.format(int(label.split('=')[1]) + 1)
node.set('label', '<br/>'.join(labels))
import graphviz
from IPython.display import Image
graph.write_png("decision_path_highlighted_2.png")
Image(graph.create_png())